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Abstract 

We propose a new self-organizing algorithm for a feed-forward network inspired 
to an electrostatic problem that turns out to have intimate relations with information 
maximization. 

Keywords: feed forward, mutual information, relaxation methods. 



1 Introduction 

q 

CZ) In this paper we present a new self-organizing algorithm for a layer of h continuous 

Perceptrons derived from the electrostatic problem of free electrical charges in a 



conductor. The algorithm is general and maximizes information. 
j^ 4 The idea is simple: we use a layer of continuous Perceptrons to map the inputs 

to point-like electrical charges that we imagine free to move within an hypercube 
i— 1 in multi-dimensional space and we let them evolve, or better relax, under Coulomb 



repulsion until they set in the minimal energy configuration. For this reason we 
named this algorithm "Neural Relax", NR in what follows. 
^\j We show that this is sufficient to obtain binary and statistically independent 

t — data as a natural consequence of the algorithm itself, in addition, fixing the di- 

mensions of the hypercube, one can freely adjust the rate of dimensional reduction. 
From a theoretical point of view, we show that, in the simple one dimensional case, 
this algorithm provides the maximum-information solution to the problem, and thus 
the learning rules result equal to those obtained by Bell and Sejnowski from their 
Independent Component Analysis (ICA) [ ] , exhibiting a completely different inter- 
• • pretation of ICA algorithm. In the general multi-dimensional case, we show that 

NR gives a pure Hebbian rule and is also well suited to inject some redundancy that 
can be subsequently used to perform error correction on the processed patterns. 

3 The paper is structured as follows: in Section 2 we briefly describe our network. 

In Section 3 we present the real physical problem we refer to, namely a system of 
point-like charges confined in a cube, and link it to our problem and to previous 
works using Coulomb-like forces in neural networks. Then we present a theoretical 
analysis for the one dimensional case (Section 4) and the general multi dimensional 
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case (Section 5). We conclude with some preliminary computational results: to test 
our algorithm we tackle the problem of preprocessing real world binary images to 
make them unbiased, uncorrelated and binary. 

2 A layer of Perceptrons 

We consider a layer of h Perceptrons with n inputs and tanh() transfer function; 
given an input igR" each Perceptron gives the output 



Hi = tanh (w-i ■ x) = tanh 




i = 1. 



(1) 



and figure 1 schematically illustrate the architecture of this network. We stretch a 
bit the notation indicating the h equations (1) with the weight matrix W 



y = tanh (Wx) 



(2) 



This is a common, well studied, network that, among other things, can be used to 
approximate any continuous function since the transfer function, tanh(x), is bounded 
in (—1,1), non constant, smooth and monotone [ ]. We will assume that the inputs 
follow a distribution p(x) and that there is no noise around; usually we will consider 
binary inputs x £ {±1}™. We will focus on the case of binary outputs y £ {±l} h 
that is the limit of the continuous case (2) when the argument is large 1 . 




Figure 1: Schematic illustration of the network: an input x v is fed to an input layer of 
n neurons, connected to h neurons that produce the output y v . The weight matrix W 
contains also the thresholds that appear as weights of a fictitious 0-th input clamped 
at 1. 



Nadal and Parga [13] studied this network when y — sgn (Wx) in the frame 
of information theory. They showed that the information capacity C that can be 
conveyed by h binary neurons is bounded by h, i.e. 

C := max I(x; y) < h 

p{x) 



1 given that lim^oo tanh(/3a;) = sgn(s) 



where I(S;y) is the mutual information between the input x, of distribution p(x), 
and the output y. The limitation comes essentially from the architecture since h 
binary neurons can possibly implement only Ch, n < 2^ of the theoretically possible 
2 h output states and they show that 

f/i for/i<n 

<? = log 2 C h . n = < 

\<h for h > n , 

So assuming h < n we see that the architecture doesn't impose any limitation 2 
and, for these binary neurons without noise, the upper bound C can be reached if, 
and only if, the distribution of the outputs q(y) results fully factorized [ ], namely 

h 1 

i{y) = \\_i{vi) with q ( yi = ±1 ) = 9 v * • ( 3 ) 



With the help of this analysis we can set up a list of the desirable characteristics 
for the function / : R" — > \R h (2) implemented by our layer of h Perceptrons: 

• the output patterns should be (essentially) binary i.e. 1 — |j/j| < e; 

• the map / : R n — ► R h should be injective and such that (3) holds; 

• as consequence the produced data will be statistically independent: 

E [ViiVt 2 ■ ■ ■ Vi r ] = V ii ^ i 2 7^ • • • 7^ V) V 1 < r < h 

(and thus uncorrelated E[yiyj] — V i^ j); 

• it should accomplish dimensionality reduction i.e. whenever possible h -C n; 

• it should be "learnable" i.e. it should be possible to find it by gradient descent 
along an appropriate function of the weights. 

The most demanding goal is satisfying (3) but it's not easy to find an algorithm 
that does it directly. Several authors followed the equivalent path of maximizing the 
mutual information I(x;y), e.g. the ICA algorithm [1]; see also [l(i] and references 
therein. Our algorithm starts from a physical problem that leads naturally towards 
the fulfillment of these requests. 

3 The Physical Problem 

Let's consider the problem of finding the stable equilibrium position of m, equal, 
point-like, electric charges Q u within a cube of conductor. This is a problem very 
similar to the Thomson problem [21] where the charges are in a sphere. Thomson 
posed it in 1904 and is remarkably difficult to solve, exact solutions are known only 
for few values of m; see [ . -]. From now on we will always consider our cube centered 
at the origin and with side of length 2, i.e. the physical space available to the charges 
is the 3— dimensional cube defined by 

^3 = {yeR 3 :|y l |<i * = 1,2,3} 



2 We just remind that this is different from the request that there is no information loss that depends 
on the source entropy S{x) and would require that h > S(x). 



the extension to the /i-dimensional hypercube Hh being obvious. In an ideal con- 
ductor the m charges are free to move and their stable rest positions y v minimize 
the Coulomb potential 3 [ ] 



U(yi,y 2 ,---,y m ) = ^2 



fi.<v 






jj,,u = 1, ...,m 



U (yi,y2, ■ ■ ■ , Vra) is a harmonic function [2] and thus doesn't have minima in an 
open, convex, set like H%, thus the rest positions of the charges are on the border, 
namely on the surface of the cube. Moreover we conjecture that, if the charges are 
equal and their number is m < 2 3 = 8, the only stable positions of the charges are on 
cube vertices as shown in Figure 2, that contains the minimum energy arrangements 
for two, three, four and five charges 4 . 







Figure 2: Stable equilibrium configurations of point-like charges in a cubic box: parti- 
cles arrange in such a way to maximize their reciprocal distances while minimizing the 
Coulomb potential energy. Since they occupy the vertices they have (almost) binary 
coordinates in the defined set H3. 



This problem easily generalizes from R 3 to R h provided that U (yx, j/2, • • ■ , Vm) 
remains harmonic and this happens iff the distance between charges generalizes to 



|jf/x - Vu\ ■= [(Vti - Vv) ■ (y» - y v )\ 



(4) 



Also in this case the rest positions of the charges must be on the border of Hh and 
we generalize our conjecture that charges have stable rest positions on the vertices 
of Hh and consequently (almost) binary coordinates. 

We take inspiration from this physical problem to propose a self-organizing al- 
gorithm for a layer of continuous Perceptrons. We map our set of m inputs in R" to 
point-like charges in R /l and these charges are bound to remain in the /i-dimensional 
hypercube. Subsequently we let this system evolve under Coulomb repulsion in R h 
minimizing its energy until it reaches equilibrium. Provided that our conjecture is 
true and if m < 2 , the charges at rest will occupy the vertices of Hh and have 
thus binary coordinates, which means that this approach allows us to get a binary 
representation of the input data as a natural consequence and without any further 
constraint. We will also show that this process maximizes information. 

More in detail, given a set of in inputs i„ <E R n , v = 1, 2, . . . , m of distribution 
p(x), applying (2) we get m outputs y v £ R h that the hyperbolic tangent constrains 



3 in Gaussian units: -r-^— = 1 

47T60 

4t 



Despite several attempts we haven't been able to prove this formally but numerical simulations 
support the conjecture. 



within the ^.-dimensional hypercube Hh- To treat inputs of different probability 
p{xy) we postulate that the probability of an output y v is proportional to the energy 
of a charge Q v in the electric field, i.e. 






q(y v ) ex E(Q V ) = Q^ ^ (5) 

\y^ Vv\ 



Hjtu 



and the total energy of the system is: 



For the sake of simplicity most of the times we will assume that all inputs are 
equiprobable p(x„) = — and thus we will feel free to put Q u = 1 for all m charges 
and the function to minimize is the simplified Coulomb potential 

U (yi,y 2 , ■ ■ ■ , y m ) = ^2 | ^ _ ^ | fj,,v = l ) ...,m . (7) 

This "energy" is the function that NR learning algorithm minimizes modifying the 
elements of the weight matrix W by gradient descent namely 

W « = ^ ~ £ ^ (8) 

e being a small positive constant. 

Let us suppose that NR has been successfully applied and that the harmonic 
function U has been minimized (more on this later). All the m charges have relaxed 
in the minimum energy configuration and necessarily lie on Hh surface and, if m < 
2 h and our conjecture is true, they sit precisely on the vertices of the hypercube 
H] x . It follows that all coordinates of their positions y v are binary and represent 
satisfactorily the outputs of h binary neurons. 

With distance definition (4) we know that U is harmonic and Gauss theorem 
holds. We use these properties to show that the positions of our charges satisfy (3) 
in the limit n, m, h — > oo when we can neglect the granularity of the charges and we 
can assume that the charge distribution becomes continuous. A similar approach is 
usually taken for idealized physical conductors where one forgets the quantization of 
electron charges since the single electron charge is considered negligible with respect 
to the total charge on the conductor. 

When the charges have relaxed in the minimum energy configuration we know 
that there is no electric field within the conductors and that all charges lie on the 
(hyper-)surface, moreover the spatial density of the charges must be constant in 
the limit n, m, h — > oo. It follows, given the Hh structure 5 , that every hyperplane 
through the origin of R h and that doesn't hit any vertex of Hh (to avoid compli- 
cations) cuts Hh into two parts that contain the same number of vertices, since, 
if vertex v belongs to one of the semi-spaces, vertex —v must belong to the other 
one. From the constancy of the spatial density of the charges it follows that the 
two semi-spaces must also contain exactly the same charge, one half of the total 
charge on Hh- Since this results is valid for any hyperplane through the origin of 



one can observe that if the charges sit on hypercube vertices they also lie on the hypersphere of 

h-2 

radius h 2 and continue the following proofs for the hypersphere 



R h it is true also for the h hyperplanes yi = 0. This means that there are exactly 
Y charges with j/j = 1 (remember all coordinates are binary) and the same number 
with yi = — 1. In the language of our layer of Perceptrons and since m — > oo this 
means that the output distribution is such that 

q (y t = ±1) = - Vi . 

h 

It's also easy to prove by induction that q(y) — Y\.Q.{Vi)i we begin showing that 

8=1 

liVijUj) — liydldlj) f° r an y couple of different coordinates yi and yj. Let's suppose 
we have cut our charge distribution into two equal parts by the hyperplane y^ = 
and we consider the orthogonal hyperplane yj = 0, it's easy to use the previous 
argument to show that in all 4 subspaces so defined the charges must be equal to ^ 
and thus that for any choices of the values of yi and yj one gets q{yi,yj) = 4 and 

k 

thus q(yi,yj) = q(yi)q{yj). Let's now suppose q(y tl ,y i2 , . . . ,y ik ) = ]Jq (y %] ) = ^ 

for any choice of k variables y^, j/j 2 , . . . , yi k , it's easy to exploit the structure of Hh to 
show that, if one adds a (k + l)-th coordinate, the hyperplane of equation j/,- fc+1 = 
will cut all the previous charges into 2 halves and thus that q(yi 1 , yt 2 , . . . , yi k , yi k+1 ) = 
fe+i 
IH 1 \Vij) = prr completing the proof by induction. A technical point: we note 

that only for m — 2 h one can continue the induction chain up to step k = h 
giving q(y) = 2~ h for any y and complete factorization of the distribution q(y); if 
m < 2 h one can only prove that all the moments of order k of q(y) are zero up to 
k = [log 2 m\ . 

We have thus proved that, if the m charges relax in the configuration of minimal 
energy (that by the way it's far from being unique given the many symmetries 
of the system) the final positions of the charges satisfy all the requests set for a 
layer of Perceptrons at the end of the previous Section, in particular that the final 
distribution is fully factorized (3) that implies that the information produced at the 
output is maximal. 

There is one point we left behind that deserves attention: we saw that U (yi, ifa, ■ ■ ■ 
is a function of the mh charges coordinates and is provably harmonic, but in our 
case, with (1), we can change j/j coordinates only through the (n + l)h weights Wij. 
It is simple to verify that U(wij) is no more harmonic: 

dU dU dy t 



dwij dyi du>ij 

d 2 U d 2 Ufdy t \ 2 dU d 2 Vl 



dw^ dyf \dwijj dyi dw^ 

and in general V 2 U(wij) = 'J2 i ■ 4^- 7^ 0. This means that the restrictions imposed 

to the positions of the charges y v by the fact that they are defined by y — tanh (Wx) 
— that, by the way, enforces also the constraints y„ € Hh — renders the energy no 
more harmonic in the "free" coordinates Wij. This implies that we cannot formally 
prove that the function U(uiij) is without local minima and that gradient descent 
(8) will always bring the system to one of the solutions we just described essen- 
tially because we cannot "move" freely the charge positions y v but only through the 
variation of the weights Wij . 



One could argue that it is reasonable to expect that the characteristics of the 
found solution won't change dramatically, especially if m <C 2 , and the charges are 
very far from each other on Hh, but still the strength of a formal proof is lost. This 
argument surely deserves further investigations and will be the subject of future 
work. 

We conclude this section with a brief review of other appearances of Coulomb- like 
forces in the context of neural networks. The series started in 1987 when Bachmann 
et al. [ ] proposed an associative memory that attached negative electrical charges 
to the stored patterns and the memory played the role of a positive charge attracted 
by the patterns. In this fashion they could store unlimited patterns and the memory 
didn't have any spurious state. This idea resurged 7 year later [lo]. 

After some years Marques and Almeida [12] proposed a feed forward network 
dedicated to the separation of nonlinear mixtures that minimized a function made of 
three terms. The first term, W, was inspired to the idea of repulsion of equal charges 
and produced a repulsive force. This force was non-physical since the repulsion had 
a finite range and acted only in proximity of the patterns; the minimization of this 
term tended to keep the patterns far apart producing an approximately uniform 
distribution of the patterns. To this term they had to add a term B, enforcing the 
constraints of the outputs in [— 1, 1] not to have the patterns fly to infinity and a 
regularizing term R. This work has been subsequently analyzed in a mathematical 
setting [20] where it has been shown that, within certain approximations, a repulsive 
force decreasing faster than the Coulomb force, tends to produce uniform probability 
density of the outputs that in turn maximizes output entropy that in turn minimizes 
mutual information and is thus amenable to ICA. 

All of these works do not have a real, physical, Coulomb energy that is instead 
central in our approach since it will allow us to define properly a positive definite 
probability density (10) and will provide an energy that, at least in the ideal case, 
is harmonic and thus gives important properties to the function to be minimized. 
This kind of potential matches perfectly with the hypercube structure since charges 
tend to put themselves on the hypercube vertices thus automatically satisfying the 
other request of having binary coordinates. This produces a distribution of the 
patterns that, microscopically, is highly non uniform, being the discrete sum of 
point-like charges. On the other hand, from a larger distance, this distribution 
appears uniform thanks to Gauss theorem (as happens in real conductors). 

4 Analysis of the 1-dimensional Case 

We start analyzing NR properties in a toy problem: a layer made of just one neuron 
with one input; i.e. a purely one dimensional problem. This is a well studied case 
[1, 14, 4] where theoretical analysis is simpler: here (1) becomes 

y = tanh(u>x + Wq) . (9) 

Only for the analysis of this case we relax the condition of digital inputs since 
this would restrict us to the too simple case x = ±1. So here we suppose to have 
continuos inputs x with probability distribution p(x). Correspondingly we have 
continuos y with an electrical charge density p(y) and the energy of the system (6) 
becomes: 



calling <f> (y) := f , _ ,, dy' the total potential of point y, we have 

U= fp(y)4>(y)dy.= [q(y)dy (10) 

where q(y) is the linear energy density that is by definition positive since it is pro- 
portional to the squared electric field [-]. It is thus possible to extend (5) and to 
interpret q(y) (suitably normalized) also as the probability density distribution of y. 
Our problem is, given x and p(x), to determine the parameters w, wq that minimize 
U. 

We can gain insight into the actual solution of this problem examining first the 
corresponding physical problem: since our charges in y are to be imagined as free 
charges in a conductor this is the physical problem of the charge distribution on a 
finite (remember — 1 < y < 1), infinitely thin, conductive wire. 

It is a typical electrostatic problem: one has to find the charge distribution p (y) 
that minimizes U. In this particular case we are in a conductor and thus, when the 
energy is minimized, the potential is constant <j> (y) = cj) and so mathematically the 
problem is to find the charge distribution p (y) that realizes this condition. This is 
not an easy problem (it has been the subject of James Clerk Maxwell's last scientific 
paper, see in [10]) but is known [0] that, as the ratio of the physical dimensions of 
the wire goes to zero, the distribution of the charges on the wire p (y) tends to a 
uniform distribution, i.e. p (y) — > p . So we can conclude that the physical solution 
that minimizes (10) gives q(y) = po</V 

This is true for the physical problem where, since the charges in the wire are free 
to move, the distribution of charges p (y) can take any shape. Viceversa it is clear 
that in our case, where we can play only with the parameters w,wq to modify p (y), 
in general it will be impossible to find values of w,Wq that realize the condition 

q(y) = po<Po- 

But let us suppose that we are in this lucky case; to understand what is the 
meaning for our problem we use the well known relation for the transformation 
of a distribution p{x) when the variable x is transformed to y = f w (x) where w 
represent the parameters of the function /() that has to be invertible. In this case 
the distribution q(y) of y is given by 

/ N P( X ) 

q(y) - 



dx I 



and this relation tells us that to get a constant q(y) necessarily | a^ 1 K p( x ) 
and thus the function y = f w {x) needs to be proportional to the primitive of the 
probability distribution of x, namely 

f w (x) ex p(x)dx (11) 

and it's well known that this represents the maximum entropy solution for our one- 
neuron net [L]. So, if adjusting w and wo we can obtain that indeed (11) holds, our 
system minimizes energy (10) and this solution gives also the maximum information. 
In our case (9) one obtains: 

tanh'(wa; -I- u>o)|w| oc p(x) 

where we used the fact that tanh (x) > 0, and this relation can also be interpreted 
to give the only possible p(x) for which we get the optimal solution. As pointed out 



by one of the referees this can be a severe limitation to which one could put remedy 
adapting not just the weights but, as done in [1 4], the transfer function itself f w (x). 
This would produce a more powerful neuron but, following Bell and Sejnowski's 
ICA, we decided purposely not to open this Pandora's jar at this stage. 

Now we analyze what happens in the general case when (11) can't be satisfied 
exactly and the best one can do is to find the values of the parameters w that 
minimize U , i.e. we study 



dw dw 



dw 



dx 



where we applied Leibnitz's rule for differentiation under the integral since we are 
dealing with continuous functions. We observe that the only term that depends on 
w, and is thus affected by the derivative, is | *!f \ ■ 

We conclude this Section showing that the learning rules for our network (9), 
obtained by (12), are equivalent to the Bell and Sejnowski's ICA [ ]. We start 
performing the derivation with respect to w and wq 



_dU 

dw 
dU 
dw 



p(x) 



[f w ' (wx + wo) \w\ 
p(x) 



f w " (wx + wo) \w\x H f w ' (wx + Wo) 



dx 



[fu 



with our choice y 



' {wx + Wo) \w\ 
■ f w (wx + W Q ) 



[fu 



tanh 



[wx 



wx - 



■w ) \w\] dx 
wq)', then 



\f w ' (wx + wo) = 1-y 2 > 
\f w "{wx + Wo) = -2y(l-y 2 ) 

that substituted in previous equations give 



(13) 



dU 

dw 



p(x) 



[(1 


-y 2 ) 

p{x) 


i ii 2 
\w\] 


(1 


-v 2 ) 


w\ 2 




p(x) 




(1 


-v 2 ) 


w\ i 



-2y(l-y 2 



\w\ 



(1 



da; = 



-2y\i 



\w\ 
w 



dx = 



2yx 



dx 



dU 

dw 



p(x) 



[(i-y 2 ) Mr 

P(x) 

(i-y 2 )\w\ 2 



[-2y(l-y 2 )H] dx 



-2y\w\] dx 



p(x) 



(l-y 2 )\w\ 



-2y] dx 



Comparing these equations with (10) we note that the term J ?rj^fyi — i da: is 
nothing but the Coulomb energy U integrated over x, and hence, as anticipated, it 
is possible to interpret it as a distribution over which the terms in square brackets 
can be considered averaged, so we can also write them as expectation values: 



du 

dw 
dU 

dw 



Eu [-21/] 



and comparing these relations with ICA's learning rules [4] (remembering that we use 
slightly different transfer functions), we see that they are equal. This shows that 
NR and ICA are intimately related and that, even if they start from completely 
different starting points, essentially they both end up maximizing information. 



5 The Multidimensional Case 

We now proceed to examine the general multidimensional case: we start with m 
binary inputs of n bits each (that in our numerical simulations will be binary images) 



x v e{±l} n */ = 1,2,. 



. ,m 



fed to a layer of h neurons thus producing, for each input, 

y v = tanh(W^) € (-1, l) h u = l,2,...,m 

where the dimensionality of the output layer h is a quite arbitrary choice: it repre- 
sents somehow the compression rate of the system 6 . To each output y v produced 
we attach an arbitrary unitary electric charge. Then we calculate the Coulomb po- 
tential (7) and apply gradient descent to it to obtain the learning rules. With the 
standard distance definition (4) in h— dimensional space we get 



IS//*- S/i/l = 

that gives the learning rule for h > 2 
dU d 



E (y»i -Vvif 



Am,, 



dwi 



dw 






1 



Vii-VA 



2-h 



lkuWh -yu\ h - 2 

where we used the properties (13) of the hyperbolic tangent. 

We used the only possible definition of the distance |y M — y v \ that renders the 
energy U harmonic in the mh variables y v i but this is of little use for us since in 
general U is not harmonic with respect to our "free" variables Wij . 

We have thus felt free to try another definition for the distance with the objec- 
tive of obtaining a faster learning algorithm. For these reasons we considered the 
expression 



Wn - vAh ■= [2 {h~y^- y v )\ 



( 



2h 



V \ 

/ .yaiVu 



\ 



J 



that is a distance in mathematical sense; indeed it is a slightly modified version of 
the so called Hamming distance, which is a measure of the difference between two 
strings of equal length 7 . With this new distance plugged in (7) we define a slightly 



3 as proposed in [13] one can distinguish 3 cases: 

- h < S(x) here the net must "compress" the data with some information loss; 

- h — S(x) here the net is perfectly matched to the incoming information; 

- h > S(x) here the net is redundant but, as explained later, with NR this redundancy can be used 
for error correction. 



7 in our notation the Hamming distance between binary vectors j7 M ,2A/ £ {±1} is h(h 



v^ ■ y») 



10 



different energy function Ur that still diverges when any two charges get too near 
to each other. Minimizing Uu the learning rule becomes 

Aw = _9Uh^ 5_y- 1 



dWij dw ij^t Wf-fivlH 



E 



or: [^jVuiO- - vli) + x vj y^(i - vU)] ( 15 ) 

that is similar to previous rule (14) with the only difference that it contains only the 
"crossed" Hebbian terms X^jt/ui and x^jy^i without the subtraction of the "straight" 
terms x^y^i and x V jy V i and that in numerical simulation appears indeed to be 
faster. 

This modified Hamming distance can be easily related to the Euclidean distance 
(4) observing that since the output of the hyperbolic tangent is in (—1, 1) it follows 
that < y 1 < h and so 

fa-2 „ „ ^=2 

Wii-vA = [(&-&)•(#*-#/)] 2 = [y^ + y u - 2y M • y u \ 2 

fa— 2 

< [2(h - yf, ■ y v )] 2 =Wn- Vu\h Vy € H h 

and the Euclidean and the Hamming distances coincide if, and only if, each com- 
ponent of each output vector is binary, which is basically what we hope to get at 
equilibrium. In terms of the energy we can thus write 

U(yi,y2,---,y m )>U H {yi,m,---,y m ) VyeH h (16) 

and we see then that the energy defined with the Hamming distance is a lower 
bound for the energy defined making use of the Euclidean one. In principle, thus, 
at equilibrium we can expect the two energies to be equal. 

Learning rules (14) and (15) share two characteristics: the first is that they are 
Hebbian since they are perfectly local in the sense that the synapse Wij connecting 
neuron y^ to input Xj is updated only with the values taken by these neurons. At 
the same time the value of the synapse is updated by the product x^yi referring only 
to different patterns: in other words to update a synapse one needs the "history" 
of the two neurons; one could say that the rule is local in space but non-local in 
time. The second interesting characteristic is that in both rules appear the terms 
(1 — yl { ) that tend to kill the learning when \y V i\ ~ 1, i.e. when the coordinates are 
substantially binary; this inhibits the weights from growing indefinitely. 

We conclude this Section observing that the outputs produced by this network 
are suited to implement error detection and correction, in other words the injective 
map / : R™ — >• E h (2) implemented by our network de facto acts as an encoder that 
realizes a block (to, h) code, see e.g. [5]. Let's suppose that to < 2 , i.e. there are less 
patterns y v then hypercube vertices to park them and that U has been minimized. 
Given the form of the energy minimized by learning (7) we know that each charge 
y v will be on a hypercube vertex and as far as possible from all other charges. Let us 
suppose that the minimum Hamming distance between different charges y v is d, it's 
well known that in this case one can detect up to d — 1 errors on the patterns y and 
correct up to L^^J errors. For example in the numerical simulations of the next 
Section, for to = 7 and h — 64, the minimum Hamming distance between different 
patterns is larger than d = 36. 
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This means that if one is given a noisy version y[, of the pattern y u (for example 
as returned by an associative memory) one can try to restore the original pattern. 
By the way the restoration could be done by minimizing again the potential energy 
U that it's no more minimal when the correct pattern y v is replaced by its noisy 
version y, v that results "out of place". 

6 Preliminary Numerical Results 

We start introducing the problem we tackled to test NR namely the preprocessing 
of real world data to build a binary, uncorrelated representation. We had in mind 
the preprocessing of binary images for an associative memory but this task is by no 
means limited to this particular problem. 

Associative memories have been one of the first applications of the neural net- 
works paradigm: introduced in 1969 by David Willshaw et al. [ ] have produced 
many offsprings: see e.g. the classical book [(>] and references therein, or, for a more 
recent review, see [11] that embeds all flavours of associative memories in a unique 
Baycsian frame. We focus on the (classical) family of associative memories made of 
a network of n McCulloch and Pitts neurons each of them updating its state Si — > 5- 
with the standard rule 

si=trt w ii s A ( 17 ) 

where the transfer function t(x) can be either smooth, e.g. t(x) = tanh(x), or binary, 
t(x) = sgn(x). Different kinds of associative memories sport different connection 
schemes and different rules for the synapses Wij but all models agree on the fact that 
the information is stored in synapses. An associative memory storing m patterns 
£„, v — 1, ... ,?7i should be able to find any of the stored patterns starting from 
a partial or noisy cue. More precisely if the network is initially in state Sq the 
(repeated) application of (17) should bring the network in one of the stored states 
i.e. Sq -t S = £„. 

A common simplification easing analytical calculations is that of assuming the 
distribution of the stored patterns to be fully factorized and unbiased: 

p (I) = Up te) with p & = ±x ) = 2 Vz (18) 

that implies that the patterns are statistically independent and binary. This request 
is exacting and, if it's strictly respected, rules out immediately all real world data 
like for example binary images or sparse coded data. 

So to deal with these data one needs to transform them first in data that ful- 
fills these requirements. The simplest transformations are the linear ones and if 
one contents himself with uncorrelated data (and not independent) than the linear 
transformation known as Principal Component Analysis can do the job. Unfortu- 
nately the transformed patterns are no more binary and it is an open problem to 
find a linear transformation that produces uncorrelated and binary data (see e.g. [19] 
or [17], an exact solution being in general impossible 8 ). So to end up with binary 
data one must give up to one of the constraints: uncorrelation or linearity of the 
transformation. 



3 the covariance matrix has integer elements but this is not true for its eigenvectors. 
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Here we abandon the request of a linear transformation but doing that we can at 
the same time soar our other goal: we will produce data that is not just uncorrelated 
but independent, while at the same time remaining binary. More precisely, given m, 
n-dimensional, binary images £ u , we look for 



/ 



y, = /(£,) 



with h < n 



and the outputs y v represent the preprocessed patterns that should be statistically 
independent and thus ready to be stored in an associative memory of h neurons. At 
this point it's clear that (2) obtained by NR, that satisfies (3), it's tailored for the 
job. 

Before presenting numerical results we just mention an additional complication 
due to the fact that associative memories usually do not recall exactly the stored 
patterns y v but return the pattern S = y[, with y^ v ~ y v the difference being typically 
a few percent of the bits. If one wants to be able to get back the original image £„ 
from \f v this imposes further requirements to the characteristics of the preprocessing 
while, at the same time, rules out standard algorithms for binary compression that 
produce statistically fragile data. As explained in previous Section NR, providing 
data that are as much farther apart as possible in R h , can fulfill also this request. 

We run a preliminary numerical test on a set of m = 7 binary images of 33 x 

33 pixels; we had a network of h = 64 neurons with n = 33 x 33 + 1 = 1,090 

inputs totalling 69,760 weights. We run two different learning runs with the two 

gradient descent rules (14) and (15) of previous Section. The program stopped 

when m&x{Awij} < 10 -5 that required of the order of 10 7 steps. Each simulation 

i,i 
took several days of an Intel Core Duo 2.93 GHz processor indicating that there is 

ample space for improvements, e.g. by taking advantage from standard electrostatics 

relaxing algorithms. 

Figure 3 show the energy decrease during learning for both the Euclidean U 
and the Hamming distance Ujj- the first impression is that, as one could expect, 
the decrease is compatible with a typical electrostatic potential; also U > Uh as 
foreseen. In this first run the expected convergence of U — > Uh was not observed 
but there are indications that U minimization was not terminated. 

Our aim was to obtain both statistically independent and binary data. To check 
this last properties is easier since we have just to check if the patterns y v rest on 
hypercube vertices. This can be seen from Figure 4 that shows an histogram of the 
values of coordinates y V i (obtained minimizing Uh) that shows that this is true as 
expected. 



L r 



J 



Figure 4: Histogram of the values of y V i coordinates showing that most of them are on 
hypercube vertices. 
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Evolution of the Potential Energy 
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Figure 3: Behaviour of the energy during the run for both the Euclidean, U, and the 
Hamming distance Uh- as can be seen, the latter is smaller than the first, as predicted 
by (16). The x axis represent the running step in unit of 10 5 elementary steps. 



To verify the independence of data (3) with the reduced statistics of this simula- 
tion is a challenging task. A necessary condition is that the marginal distributions 
p (yi = ±1) = g i-e. that each neuron cuts the input data set {xV} exactly in 2 parts. 



In our simulation this is perfectly achieved, since we got 



mxh 



7x64 



224 positive 



coordinates, and 224 negative ones. Moreover each of the ft. = 64 output neurons 
has for the m = 7 inputs exactly 3 positive and 4 negative coordinates (or viceversa) 
suggesting that if we had a larger (and even) number of initial examples, we would 
get that each neuron would have m/2 positive and negative coordinates. 

To investigate the quality of the solution we analyzed the relative distances of 
the output data y v since one can expect, once (7) has been minimized, that all 
relative distances should be equal indicating a roughly constant hypersurface charge 
distribution. We did this calculating the m x m matrix of elements y u ■ y^, that, 
when y v sit on hypercube vertices, represents substantially the distance. In order 
to make it easier to understand we converted these values to a grayscale (—h — > 
white, h — > black) and the result is shown in Figure 5. We can conclude that the m 
outputs are substantially equally spaced particularly in the second case. 

7 Conclusions 

We presented a new approach to the problem of data preprocessing by a layer of 
Perceptrons: we treat each data vector as a point-like electric charge confined in 
a ^.-dimensional hypercube, subject to simple Coulomb repulsive forces. We then 
let the system evolve as it were a real physical system, that is, until it reaches the 
minimum of the electrostatic energy. At this point, we expect that the charges will 
occupy the hypercube's vertices and will be as far as possible from each other. 

The potential energy function to minimize is continuos (since such is the transfer 
function tanh(x)), well shaped and, as far as we know, without the relative minima 
that plague so many cases in neural networks. For these reasons in this case it's sen- 
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(a) ■ (b) ■ 

Figure 5: Matrices of scalar products y v ■ y^ (converted to grayscale) for the systems 
defined by the Euclidean (a) and the Hamming (b) distance; the outputs are substan- 
tially equally spaced in both cases. From a computational point of view it turned out 
that the NR version that made use of the Hamming distance converged faster than the 
other: this may suggest, as expected, that it succeeds in providing a greater gradient. 



sible to implement a simple gradient descent that produces a strictly local learning 
rule that is very similar to a Hebb rule with the difference that to update a synapse 
one needs all the data and not just the last seen one. 

In our tests this learning algorithm doesn't shine for its speed but one can spec- 
ulate that for actual calculations one could use more refined minimization of the 
potential U exploiting the relaxation techniques used routinely for similar electro- 
static problems. 

Even with a continuous transfer function at the end one obtains binary and 
statistically independent data that in turn guarantee that the entropy of the output 
is maximized. 

Another characteristics of this network is that one can freely choose the number 
h of output neurons without any adjustment of the learning algorithm. For small 
values of h the network implements compression of the incoming data, for larger h 
just a dimensional reduction without any information loss. For even larger values of 
h one introduces redundancy in the data useful for subsequent error correction. 

Despite some encouraging results we feel that there still is ample space for further 
theoretical and computational developments. 
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